Co-SOFT-Clustering: An Information Theoretic Approach to Obtain Overlapping Clusters from Co-Occurrence Data
نویسندگان
چکیده
Co-clustering exploits co-occurrence information, from contingency tables to cluster both rows and columns simultaneously. It has been established that co-clustering produces a better clustering structure as compared to conventional methods of clustering. So far, co-clustering has only been used as a technique for producing hard clusters, which might be inadequate for applications such as document clustering. In this paper, we present an algorithm using the information theoretic approach [1] to generate overlapping (soft) clusters. The algorithm maintains probability membership for every instance to each of the possible clusters and iteratively tunes these membership values. The theoretical formulation of the criterion function is presented first, followed by the actual algorithm. We evaluate the algorithm over document/word co-occurrence information and present experimental results. Introduction Co-clustering is a relatively new clustering technique that looks at data contingency information (such as cooccurrence of documents/words, students/courses) to cluster both sides in an iterative fashion. For example, coclustering can be used to cluster movies/viewers simultaneously in a movie database to produce clusters of similar movies and users with similar interests. It has been established that clustering simultaneously on both sides gives an improved performance when compared to clustering either of the sides independent of the other [1]. So far, this technique has only been applied for producing hard clusters. Hard clustering algorithms place each data instance into exactly one cluster. This level of detail however, may not be sufficient in many cases such as in document clustering. We usually come across documents that discuss multiple, seemingly unrelated topics. In such cases, it becomes important from a classification perspective, to steer the document into multiple clusters, belonging to potentially different topics. We characterize multi-cluster membership of an instance by maintaining a probability distribution that describes its presence in each of the possible clusters. The guideline for clustering proposed in [1] is to minimize the loss in Mutual Information between the original row/column contingency distribution and the compressed distribution where multiple row (column) instances are clustered together. In this paper, we make use of the same guideline and present a different criterion function that can be used for a soft clustering task. Theory We shall denote the normalized co-occurrence matrix as the probability distribution P(x,y). Let there be k row clusters ( 1 2 3 ˆ ˆ ˆ ˆ ˆ { , , , ..., } k X x x x x = ) and l column clusters ( 1 2 ˆ ˆ ˆ ˆ { , , ..., } l Y y y y = ), without any loss of generality. With every row x , we associate a vector k C of size k, where Ck x̂ ( x ) refers to the probability of x belonging to cluster x̂ . ˆ ˆ ( ) ( | ) x k C x P x x = ˆ x̂ X ∈ Similarly, we define an equivalent vector for each of the columns. ˆ ˆ ( ) ( | ) y l C y P y y = ˆ ŷ Y ∈ These vectors are initialized to random values at the start of the algorithm. Our goal is to approximate the original distribution P(x,y) over the individual rows/columns to a compressed distribution over row/column clusters. We define the p.d.f in clustered space as follows. ˆ ˆ ( | ) ( | ) ( , ) ˆ ˆ ˆ ˆ ˆ ˆ ˆ ( , ) ( , | , ) ( , ) ( | , ) ( | , , ) ( , ) x y x y x y P x x P y y P x y P x y P x y x y P x y P x x y P y x y x P x y
منابع مشابه
Information Bottleneck Co-clustering
Co-clustering has emerged as an important approach for mining contingency data matrices. We present a novel approach to co-clustering based on the Information Bottleneck principle, called Information Bottleneck Co-clustering (IBCC), which supports both soft-partition and hardpartition co-clusterings, and leverages an annealing-style strategy to bypass local optima. Existing co-clustering method...
متن کاملHierarchical and Overlapping Co-Clustering of mRNA: miRNA Interactions
microRNAs (miRNAs) are an important class of regulatory factors controlling gene expressions at post-transcriptional level. Studies on interactions between different miRNAs and their target genes are of utmost importance to understand the role of miRNAs in the control of biological processes. This paper contributes to these studies by proposing a method for the extraction of co-clusters of miRN...
متن کاملIntellectual structure of knowledge in Nanomedicine field (2009 to 2018): A Co-Word Analysis
Introduction: The Co-word analysis has the ability to identify the intellectual structure of knowledge in a research domain and reveal its subsurface research aspects. Objective: This study examines the intellectual structure of knowledge in the field of nanomedicine during the period of 2009 to 2018 by using Co-word analysis. Materials and Methods: This paper develops a sciento...
متن کاملNGTSOM: A Novel Data Clustering Algorithm Based on Game Theoretic and Self- Organizing Map
Identifying clusters is an important aspect of data analysis. This paper proposes a noveldata clustering algorithm to increase the clustering accuracy. A novel game theoretic self-organizingmap (NGTSOM ) and neural gas (NG) are used in combination with Competitive Hebbian Learning(CHL) to improve the quality of the map and provide a better vector quantization (VQ) for clusteringdata. Different ...
متن کاملBregman Bubble Co-clustering
Clustering problems often involve datasets where only a part of the data is relevant to the problem e.g. in microarray data analysis only a subset of the genes show interesting patterns within a subset of the conditions(features). On such datasets, in order to accurately identify meaningful clusters, the non-informative data points should be automatically detected and pruned and non-discriminat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008